PubMed: Finding Similar Authors

Here we are interested in identifying similar researchers based on content they have published. We will be using PubMed articles by researchers from the University of New Mexico College of Pharmacy and University of New Mexico School of Medicine.

Methods

  1. Download article data from PubMed
    1. From pubmed.gov find articles published by the College of Pharmacy using the following search query "college of pharmacy"[AD] AND "new mexico"[AD] and save results to .csv
    2. From pubmed.gov find articles published by the School of Medicine using the following search query "school of medicine"[AD] AND "new mexico"[AD] and save results to .csv
  2. For each author, append each article title they were associated with to a single string
  3. Compute an Author-Term matrix which contains the term frequencies for each author as computed from article titles
  4. Create a Term Frequency-Inverse Document Frequency (TF-IDF) matrix to identify author-specific keywords
  5. Use hierarchical clustering to cluster authors based on keyword similarity
In [2]:
# load required packages
require(dplyr)
require(data.table)
require(tidyr)
require(tm)
require(ggplot2)
require(ape)
require(reshape2)
options(repr.plot.width = 10, repr.plot.height=10)

Creating the dataset

After each of the article sets are downloaded from pubmed, they can be read and used to create a single dataframe which contains the author and article title. In the code below we also create a separate column named affiliation which shows which school the author is affiliated with.

In [3]:
# read in the data containing authors and article titles for both the college of pharmacy and school of medicine
cop = fread("Downloads/cop.csv")
som = fread("Downloads/som.csv")

# label author affiliation
cop$affiliation = 'Pharmacy'
som$affiliation = 'Medicine'

# combine into single dataframe`
pubmed = rbind(cop, som)
head(pubmed, 1)
Out[3]:
TitleURLDescriptionDetailsShortDetailsResourceTypeIdentifiersDbEntrezUIDPropertiesaffiliation
1'Ethical responsibility' or 'a whole can of worms': differences in opinion on incidental finding review and disclosure in neuroimaging research from focus group discussions with participants, parents, IRB members, investigators, physicians and community members./pubmed/26063579Cole C, Petree LE, Phillips JP, Shoemaker JM, Holdsworth M, Helitzer DL.J Med Ethics. 2015 Oct;41(10):841-7. doi: 10.1136/medethics-2014-102552. Epub 2015 Jun 10.J Med Ethics. 2015PubMedcitationPMID:26063579pubmed26063579create date:2015/06/13 | first author:Cole CPharmacy

In the output above, we can see that the first article has 6 authors associated with it. We will need to standardize the author names by converting to upper case and then split on ',' in order to separate them.

In [4]:
# Each article contains multiple authors
pubmed = mutate(pubmed, Description = strsplit(toupper(Description), ","))

Finally we will need to split the list apart so that each author can be separated from the rest. This can be done using the unnest() function from the tidyr package. Once we are finished with this step we should have a dataframe with each row as an author.

In [5]:
pubmed = pubmed %>% unnest(Description)
head(pubmed,2)
Out[5]:
TitleURLDetailsShortDetailsResourceTypeIdentifiersDbEntrezUIDPropertiesaffiliationDescription
1'Ethical responsibility' or 'a whole can of worms': differences in opinion on incidental finding review and disclosure in neuroimaging research from focus group discussions with participants, parents, IRB members, investigators, physicians and community members./pubmed/26063579J Med Ethics. 2015 Oct;41(10):841-7. doi: 10.1136/medethics-2014-102552. Epub 2015 Jun 10.J Med Ethics. 2015PubMedcitationPMID:26063579pubmed26063579create date:2015/06/13 | first author:Cole CPharmacyCOLE C
2'Ethical responsibility' or 'a whole can of worms': differences in opinion on incidental finding review and disclosure in neuroimaging research from focus group discussions with participants, parents, IRB members, investigators, physicians and community members./pubmed/26063579J Med Ethics. 2015 Oct;41(10):841-7. doi: 10.1136/medethics-2014-102552. Epub 2015 Jun 10.J Med Ethics. 2015PubMedcitationPMID:26063579pubmed26063579create date:2015/06/13 | first author:Cole CPharmacy PETREE LE

In this output above we can see that the author list has been split apart and each row now contains one of the authors of the article along with the title and other meta data. Next we will need to group by the author column (Description) and append all of the titles together into a single string. This can be done using the dplyr package to group the data and simple user defined function clean.text which converts the text to upper case and removes all non-alpha and non-white space characters.

In [6]:
clean.text = function(x){
    gsub('[^A-Za-z ]', '', toupper(x))
}

# group by author and concat title text into single text string and select the top
# 250 authors based on total number of publications
pubmed = pubmed %>%
  mutate(Author = trimws(clean.text(Description))) %>%
  group_by(Author) %>%
  summarize(title.text = clean.text(paste(Title, collapse = ' ')),
            pub.num = n(),
            affiliation = names(sort(table(affiliation), decreasing=T)[1])) %>% # keep the most common affiliation
  filter(Author != "ET AL") %>%
  arrange(desc(pub.num)) %>%
  head(200)

dim(pubmed)

head(pubmed %>% select(Author, title.text),1)
Out[6]:
  1. 200
  2. 4
Out[6]:
Authortitle.text
1LIU KJAPPLICATION OF IN VIVO EPR IN BRAIN RESEARCH MONITORING TISSUE OXYGENATION BLOOD FLOW AND OXIDATIVE STRESS ARSENITE BINDINGINDUCED ZINC LOSS FROM PARP IS EQUIVALENT TO ZINC DEFICIENCY IN REDUCING PARP ACTIVITY LEADING TO INHIBITION OF DNA REPAIR ARSENITE CAUSES DNA DAMAGE IN KERATINOCYTES VIA GENERATION OF HYDROXYL RADICALS ARSENITE INTERACTS SELECTIVELY WITH ZINC FINGER PROTEINS CONTAINING CH OR C MOTIFS ARSENITE INTERACTS WITH DIBENZODEFPCHRYSENE DBC AT LOW LEVELS TO SUPPRESS BONE MARROW LYMPHOID PROGENITORS IN MICE ARSENITE SELECTIVELY INHIBITS MOUSE BONE MARROW LYMPHOID PROGENITOR CELL DEVELOPMENT IN VIVO AND IN VITRO AND SUPPRESSES HUMORAL IMMUNITY IN VIVO AUF MEDIATES INHIBITION BY NITRIC OXIDE OF LIPOPOLYSACCHARIDEINDUCED MATRIX METALLOPROTEINASE EXPRESSION IN CULTURED ASTROCYTES BENZOAPYRENE QUINONES INCREASE CELL PROLIFERATION GENERATE REACTIVE OXYGEN SPECIES AND TRANSACTIVATE THE EPIDERMAL GROWTH FACTOR RECEPTOR IN BREAST EPITHELIAL CELLS CEREBRAL TISSUE OXYGENATION AND OXIDATIVE BRAIN INJURY DURING ISCHEMIA AND REPERFUSION COMPARISON OF TWO NITROXIDE LABILE ESTERS FOR DELIVERING ELECTRON PARAMAGNETIC RESONANCE PROBES INTO MOUSE BRAIN CONFERENCE SUMMARY AND RECENT ADVANCES THE TH CONFERENCE ON METAL TOXICITY AND CARCINOGENESIS CONTRIBUTIONS OF REACTIVE OXYGEN SPECIES AND MITOGENACTIVATED PROTEIN KINASE SIGNALING IN ARSENITESTIMULATED HEMEOXYGENASE PRODUCTION CORRIGENDUM TO IMMUNOTOXICITY AND BIODISTRIBUTION ANALYSIS OF ARSENIC TRIOXIDE IN CB MICE FOLLOWING A WEEK INHALATION EXPOSURE TOXICOL APPL PHARMACOL DIFFERENTIAL BINDING OF MONOMETHYLARSONOUS ACID COMPARED TO ARSENITE AND ARSENIC TRIOXIDE WITH ZINC FINGER PEPTIDES AND PROTEINS DIFFERENTIAL EXPRESSION OF TISSUE INHIBITOR OF METALLOPROTEINASES IN CULTURED ASTROCYTES AND NEURONS REGULATES THE ACTIVATION OF MATRIX METALLOPROTEINASE DIRECT VISUALIZATION OF MOUSE BRAIN OXYGEN DISTRIBUTION BY ELECTRON PARAMAGNETIC RESONANCE IMAGING APPLICATION TO FOCAL CEREBRAL ISCHEMIA DIRECT VISUALIZATION OF TRAPPED ERYTHROCYTES IN RAT BRAIN AFTER FOCAL ISCHEMIA AND REPERFUSION DOES NORMOBARIC HYPEROXIA INCREASE OXIDATIVE STRESS IN ACUTE ISCHEMIC STROKE A CRITICAL REVIEW OF THE LITERATURE DUAL ACTIONS INVOLVED IN ARSENITEINDUCED OXIDATIVE DNA DAMAGE EBSELEN INDUCED C GLIOMA CELL DEATH IN OXYGEN AND GLUCOSE DEPRIVATION EFFECT OF PHENYLEPHRINE PRETREATMENT ON THE EXPRESSIONS OF AQUAPORIN AND CJUN NTERMINAL KINASE IN IRRADIATED SUBMANDIBULAR GLAND EFFECTS OF GLUCOSE CONCENTRATION ON REDOX STATUS IN RAT PRIMARY CORTICAL NEURONS UNDER HYPOXIA ELECTRON PARAMAGNETIC RESONANCEGUIDED NORMOBARIC HYPEROXIA TREATMENT PROTECTS THE BRAIN BY MAINTAINING PENUMBRAL OXYGENATION IN A RAT MODEL OF TRANSIENT FOCAL CEREBRAL ISCHEMIA ENHANCED ROS PRODUCTION AND REDOX SIGNALING WITH COMBINED ARSENITE AND UVA EXPOSURE CONTRIBUTION OF NADPH OXIDASE ENVIRONMENTALLY RELEVANT CONCENTRATIONS OF ARSENITE AND MONOMETHYLARSONOUS ACID INHIBIT ILSTAT CYTOKINE SIGNALING PATHWAYS IN MOUSE CDCDCD DOUBLE NEGATIVE THYMUS CELLS ENVIRONMENTALLY RELEVANT CONCENTRATIONS OF ARSENITE INDUCE DOSEDEPENDENT DIFFERENTIAL GENOTOXICITY THROUGH POLYADPRIBOSE POLYMERASE INHIBITION AND OXIDATIVE STRESS IN MOUSE THYMUS CELLS EVALUATION OF SPIN TRAPPING AGENTS AND TRAPPING CONDITIONS FOR DETECTION OF CELLGENERATED REACTIVE OXYGEN SPECIES EXTENDED NORMOBARIC HYPEROXIA THERAPY YIELDS GREATER NEUROPROTECTION FOR FOCAL TRANSIENT ISCHEMIAREPERFUSION IN RATS GENERATION OF HYDROGEN PEROXIDE DURING BRIEF OXYGENGLUCOSE DEPRIVATION INDUCES PRECONDITIONING NEURONAL PROTECTION IN PRIMARY CULTURED NEURONS GLUCOSE UPREGULATES HIF ALPHA EXPRESSION IN PRIMARY CORTICAL NEURONS IN RESPONSE TO HYPOXIA THROUGH MAINTAINING CELLULAR REDOX STATUS HYDROETHIDINE DETECTION OF SUPEROXIDE PRODUCTION DURING THE LITHIUMPILOCARPINE MODEL OF STATUS EPILEPTICUS HYDROXYL RADICAL FORMATION IS GREATER IN STRIATAL CORE THAN IN PENUMBRA IN A RAT MODEL OF ISCHEMIC STROKE IMMUNOTOXICITY AND BIODISTRIBUTION ANALYSIS OF ARSENIC TRIOXIDE IN CBL MICE FOLLOWING A WEEK INHALATION EXPOSURE IN VIVO EVIDENCE OF METHAMPHETAMINE INDUCED ATTENUATION OF BRAIN TISSUE OXYGENATION AS MEASURED BY EPR OXIMETRY IN VIVO REDUCTION OF CHROMIUM VI AND ITS RELATED FREE RADICAL GENERATION INDUCTION OF HEME OXYGENASE BY ARSENITE INHIBITS CYTOKINEINDUCED MONOCYTE ADHESION TO HUMAN ENDOTHELIAL CELLS INORGANIC ARSENIC COMPOUNDS CAUSE OXIDATIVE DAMAGE TO DNA AND PROTEIN BY INDUCING ROS AND RNS GENERATION IN HUMAN KERATINOCYTES INTERSTITIAL PO IN ISCHEMIC PENUMBRA AND CORE ARE DIFFERENTIALLY AFFECTED FOLLOWING TRANSIENT FOCAL CEREBRAL ISCHEMIA IN RATS LOW CONCENTRATION OF ARSENITE EXACERBATES UVRINDUCED DNA STRAND BREAKS BY INHIBITING PARP ACTIVITY LOWDOSE SYNERGISTIC IMMUNOSUPPRESSION OF TDEPENDENT ANTIBODY RESPONSES BY POLYCYCLIC AROMATIC HYDROCARBONS AND ARSENIC IN CBLJ MURINE SPLEEN CELLS MONOMETHYLARSONOUS ACID MMA INHIBITS IL SIGNALING IN MOUSE PREB CELLS NITRIC OXIDE INTERACTS WITH CAVEOLIN TO FACILITATE AUTOPHAGYLYSOSOMEMEDIATED CLAUDIN DEGRADATION IN OXYGENGLUCOSE DEPRIVATIONTREATED ENDOTHELIAL CELLS NORMOBARIC HYPEROXIA ATTENUATES EARLY BLOODBRAIN BARRIER DISRUPTION BY INHIBITING MMPMEDIATED OCCLUDIN DEGRADATION IN FOCAL CEREBRAL ISCHEMIA NORMOBARIC HYPEROXIA COMBINED WITH MINOCYCLINE PROVIDES GREATER NEUROPROTECTION THAN EITHER ALONE IN TRANSIENT FOCAL CEREBRAL ISCHEMIA NORMOBARIC HYPEROXIA DELAYS AND ATTENUATES EARLY NITRIC OXIDE PRODUCTION IN FOCAL CEREBRAL ISCHEMIC RATS NORMOBARIC HYPEROXIA INHIBITS NADPH OXIDASEMEDIATED MATRIX METALLOPROTEINASE INDUCTION IN CEREBRAL MICROVESSELS IN EXPERIMENTAL STROKE NORMOBARIC HYPEROXIA REDUCES THE NEUROVASCULAR COMPLICATIONS ASSOCIATED WITH DELAYED TISSUE PLASMINOGEN ACTIVATOR TREATMENT IN A RAT MODEL OF FOCAL CEREBRAL ISCHEMIA ON THE APPLICATION OF HYDROXYBENZOIC ACID AS A TRAPPING AGENT TO STUDY HYDROXYL RADICAL GENERATION DURING CEREBRAL ISCHEMIA AND REPERFUSION OXIDATIVE MECHANISM OF ARSENIC TOXICITY AND CARCINOGENESIS OXIDATIVE STRESS AND APOPTOSIS IN METAL IONINDUCED CARCINOGENESIS PEROXYNITRITE DECOMPOSITION CATALYST REDUCES DELAYED THROMBOLYSISINDUCED HEMORRHAGIC TRANSFORMATION IN ISCHEMIAREPERFUSED RAT BRAINS POLYADPRIBOSE CONTRIBUTES TO AN ASSOCIATION BETWEEN POLYADPRIBOSE POLYMERASE AND XERODERMA PIGMENTOSUM COMPLEMENTATION GROUP A IN NUCLEOTIDE EXCISION REPAIR POLYADPRIBOSE POLYMERASE INHIBITION BY ARSENITE PROMOTES THE SURVIVAL OF CELLS WITH UNREPAIRED DNA LESIONS INDUCED BY UV EXPOSURE REACTIONBASED OFFON FLUORESCENT PROBE ENABLING DETECTION OF ENDOGENOUS LABILE FE AND IMAGING OF ZNINDUCED FE FLUX IN LIVING CELLS AND ELEVATED FE IN ISCHEMIC STROKE REDUCTION OF ARSENITEENHANCED ULTRAVIOLET RADIATIONINDUCED DNA DAMAGE BY SUPPLEMENTAL ZINC REDUCTION OF ZINC ACCUMULATION IN MITOCHONDRIA CONTRIBUTES TO DECREASED CEREBRAL ISCHEMIC INJURY BY NORMOBARIC HYPEROXIA TREATMENT IN AN EXPERIMENTAL STROKE MODEL SELECTIVE SENSITIZATION OF ZINC FINGER PROTEIN OXIDATION BY REACTIVE OXYGEN SPECIES THROUGH ARSENIC BINDING SPATIOTEMPORAL EVOLUTION OF BLOOD BRAIN BARRIER DAMAGE AND TISSUE INFARCTION WITHIN THE FIRST H AFTER ISCHEMIA ONSET TISSUE OXYGEN IS REDUCED IN WHITE MATTER OF SPONTANEOUSLY HYPERTENSIVESTROKE PRONE RATS A LONGITUDINAL STUDY WITH ELECTRON PARAMAGNETIC RESONANCE USE OF ACETOXYMETHOXYCARBONYLTETRAMETHYLPYRROLIDINYLOXYL AS AN EPR OXIMETRY PROBE POTENTIAL FOR IN VIVO MEASUREMENT OF TISSUE OXYGENATION IN MOUSE BRAIN XANTHINE OXIDASE ACTIVATES PROMATRIX METALLOPROTEINASE IN CULTURED RAT VASCULAR SMOOTH MUSCLE CELLS THROUGH NONFREE RADICAL MECHANISMS ZNT EXPRESSION REDUCTION ENHANCES FREE ZINC ACCUMULATION IN ASTROCYTES AFTER ISCHEMIC STROKE

We now have a data frame where each row represents a unique author long with a concatenation of every title they have authored. Next we will need to create the Author-Term matrix, this can be done using the tm package.

Creat the Author-Term Matrix

Now that we have a data frame containing authors and concatenated titles, we can easily make an author-term matrix which contains the term frequencies for each author according to how frequently the terms were used in article titles.

In [7]:
# create the corpus using the concatenated article title text for each author
corpus = VCorpus(VectorSource(pubmed$title.text))
# remove stopwords
mycorp = tm_map(corpus, removeWords, stopwords('english'))
In [8]:
# create the author term matrix 
dtm = DocumentTermMatrix(mycorp)
rownames(dtm) = pubmed$Author

Create Term Frequency Inverse Document Frequency Matrix

The term frequency-inverse document frequency is a measure that reflects how important a word is to a particular document. It is essential the number of time a word appears in a document offset by the number of times the word appears throughout the corpus in general. In our case, it is the number of times an author uses a word in the title of their article offset by the number of times other authors use the same word in their articles.

In [9]:
# compute the tf-idf matrix
tfidf = weightTfIdf(dtm, normalize = T)

Since the TF-IDF weight reflects a words importance to a particular document (or in this case author) we can list the top 10 terms (ranked by TFIDF weight) for each author to see which terms are most descriptive of their research.

In [10]:
# Find the top 10 terms by TF-IDF for each author
top.terms = apply(tfidf, 1, function(x) colnames(tfidf)[order(x, decreasing = T)[1:10]])
top.terms[,1:5]
Out[10]:
LIU KJBURCHIEL SWHUDSON LGGLEW RHRAISCH DW
arsenite aromatic epidermalnigeria adverse
cerebral polycyclicfactor fatty methods
hyperoxia hydrocarbonsgrowth sickle drug
normobaricfollowing arsenite nigerian reactions
ischemia calcium receptor northern prescribing
zinc benzoapyreneovarian serum pediatric
brain cells dna childrenreview
focal cell cells fulanievents
ischemic dimethylbenzaanthracenearsenic disease oncology
tissue line zinc acids administrations

From the output above we can see that Dr. Jim Liu's research seems to focus on arsenite, ischemia, and the brain whereas Dr. Dennis Raisch's research is focused on adverse, events, and reactions this seems appropriate given Dr. Raisch's experience in mining adverse drug events from the FDA Adverse Event Reporting System.

Cluster Authors by TFIDF Weighted Article Terms

Now that we've compute the TFIDF weights for each word for each author, we can use hierarchical clustering to group authors based on title term similarity. Authors who use similar words in the titles of their papers will tend to cluster together. This will allow us to see interesting patters such which authors are publishing on similar topics. One thing to note is that if there are many more variables (i.e. terms) than documents (i.e. authors) non-sensical groups may emerge. In our case we have a total of 310 authors and 3,959 different terms/variables therefore we do not really have to worry about this problem.

In [11]:
hc = hclust(dist(tfidf))
color = pubmed$affiliation
color[color=="Pharmacy"] = "#64706c"
color[color=="Medicine"] = "#935347"
table(color)
Out[11]:
color
#64706c #935347 
    102      98 
In [12]:
par(mai=c(1,0,1,0))
plot(as.phylo(hc), main="PubMed Author Similarity", tip.color = color, type="fan", cex=0.7, label.offset = 0.05)

Inspecting the Groups

From the dendrogram above we can see clear groups emerge. Some groups cluster pretty tightly while other groups are less tightly connected. Some authors are identical, suggesting that their word vectors are identical. This seems to suggest that these authors co-occur on the same papers. In this next section we will write a function to extract the top 10 terms for a particular author.

In [13]:
# function for returning top 10 terms for particular author
get_top_terms = function(author, data, n = 10){
    colnames(data)[order(as.matrix(data[author,]), decreasing=T)[1:n]]
}

Let look at the top terms for some tightly clustered authors; for example GARVER WS, and JELINEK D

GARVER WS and JELINEK D

In [14]:
get_top_terms('GARVER WS', tfidf)
Out[14]:
  1. 'niemannpick'
  2. 'weight'
  3. 'dosage'
  4. 'highfat'
  5. 'cblj'
  6. 'interaction'
  7. 'decreased'
  8. 'diet'
  9. 'confirmation'
  10. 'features'
In [15]:
get_top_terms('JELINEK D', tfidf)
Out[15]:
  1. 'niemannpick'
  2. 'weight'
  3. 'dosage'
  4. 'highfat'
  5. 'cblj'
  6. 'interaction'
  7. 'decreased'
  8. 'diet'
  9. 'confirmation'
  10. 'features'

As expected, these authors have identical word vectors suggesting that they have similar research intersts. In fact, it is likely that these two authors are co-authors.

ANTONCULVER H and ABEN KK

In [16]:
get_top_terms('ANTONCULVER H', tfidf)
get_top_terms('ABEN KK', tfidf)
Out[16]:
  1. 'ovarian'
  2. 'cancer'
  3. 'risk'
  4. 'epithelial'
  5. 'eoc'
  6. 'common'
  7. 'gene'
  8. 'serous'
  9. 'genes'
  10. 'variants'
Out[16]:
  1. 'ovarian'
  2. 'cancer'
  3. 'risk'
  4. 'eoc'
  5. 'epithelial'
  6. 'common'
  7. 'gene'
  8. 'serous'
  9. 'genes'
  10. 'variants'

PAI MP and MERCIER RC

In [17]:
get_top_terms('PAI MP', tfidf)
get_top_terms('MERCIER RC', tfidf)
Out[17]:
  1. 'candida'
  2. 'bloodstream'
  3. 'dosing'
  4. 'antifungal'
  5. 'hemodialysis'
  6. 'receiving'
  7. 'obese'
  8. 'combinations'
  9. 'endocardial'
  10. 'simulated'
Out[17]:
  1. 'vancomycin'
  2. 'aureus'
  3. 'staphylococcus'
  4. 'methicillinresistant'
  5. 'against'
  6. 'antimicrobial'
  7. 'daptomycin'
  8. 'piperacillintazobactam'
  9. 'vancomycinintermediate'
  10. 'intravenous'

Its clear that the above authors frequently publish on infectious disease.

Opportunities for Collaboration

Now that we have successfully clustered authors into similar groups, we can potentially see which authors would collaborate well together. For example, lets look at Rey GM and Shah VO. I happen to know that Dr. Rey (College of Pharmacy) specializes in Diabetes as well as Dr. Shah (School of Medicine). In addition, both of these authors cluster near each other in the dendrogram. A quick review of the dataset shows that these authors both published a paper together Comparison of the fatty acid composition of the serum phospholipids of controls, prediabetics and adults with type 2 diabetes suggesting these authors have collaborated in the past. What other examples of interdisciplinary collaboration are apparent in the graph? I will leave that exercise to all of the intrepid readers out there.

Limitations

There are some limitations of this method, most notably is the fact that non-sensical groups will start to emerge when we have many more features (terms) than there are authors. In our case, we have only 200 authors compared to 3,644 terms so we do not have to worry. Another issues is how author affiliation was assigned. In this project I assigned the author to either "Pharmacy" or "Medicine" depending on which dataset the name appeared in most frequently. For example if an author published 4 papers that showed up in the College of Pharmacy search results and only 2 times in the School of Medicine results; they received the Pharmacy affiliation. This may misclassify authors into one of the two categories. Another limitation is the names used to identify the authors could change from paper to paper. For example, I may publish a paper using Bernauer ML one time and use Bernauer M the next. No efforts were made to correct for this and as a result we may not have correctly grouped the authors appropriately.